Abstract
In this document we will apply Seurat pipeline to 10X snRNAseq data from human and chimp skeletal muscles cells. More specifically, we will check whether we find clustering with respect to skeletal muscle type 1 and type 2A / 2X fibrotypes. This analysis was published in Skeletal Muscle in 2022, https://skeletalmusclejournal.biomedcentral.com/articles/10.1186/s13395-022-00299-4, if you use the data for your research, we kindly ask you to cite the study.
Let us start with loading packages:
library("Seurat")
## Attaching SeuratObject
## Attaching sp
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("data.table")
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library("matrixStats")
##
## Attaching package: 'matrixStats'
## The following object is masked from 'package:dplyr':
##
## count
library("DoubletFinder")
Now we will load Skeletal Muscle scRNAseq un-normalized data and have a look:
expr <- suppressWarnings(as.data.frame(fread("unzip -p SkeletalMuscle_HumanChimp_10X.zip",sep="\t")))
rownames(expr)<-expr$V1; expr$V1<-NULL;
expr <- expr[rowSums(expr) != 0,]
#expr <- expr[rowMeans(as.matrix(expr)) >= 0.1,]
expr[1:5,1:5]
## C00001_chimp C00004_human C00005_human C00006_chimp C00007_chimp
## RP11.34P13.7 0 0 0 0 0
## FO538757.2 0 0 0 0 0
## AP006222.2 0 0 0 0 0
## RP4.669L17.10 0 0 0 0 0
## RP5.857K21.4 0 0 0 0 0
print(paste0("DATASET CONTAINS ",dim(expr)[1]," GENES AND ",dim(expr)[2]," CELLS"))
## [1] "DATASET CONTAINS 18104 GENES AND 6578 CELLS"
First we are going to look at which genes are most expressed in our data set:
head(sort(rowSums(expr), decreasing=TRUE), 20)
## MALAT1 NEAT1 TTN NEB ACTA1 EMC10 TNNC2
## 1017013 53361 45022 18865 18366 16743 14244
## MYBPC1 TRDN TNNT1 TNNT3 MIR133A1HG TCAP MYH7
## 11994 10839 10835 10462 7948 6867 6432
## ABCA5 TPM2 PDLIM3 MB NEXN TNNC1
## 6329 6142 5901 5753 5634 5526
barplot(log10(as.numeric(sort(rowSums(expr), decreasing=TRUE)) + 1), xlim=c(0,20), names = names(sort(rowSums(expr), decreasing=TRUE)),ylab="LOG10(EXPR)",las=2)
MALAT1 is a nuclear lincRNA which is very abundant. Typically in scRNAseq experiments, high expression of MALAT1 indicates that the cellular membrane is broken and cytoplasmic mRNA starts leaking out. Some studies show that MALAT1 expression is correlated with the mitochondrial contamination. However, in our case of snRNAseq, high expression of MALAT1 is to be expected. Still common practice is to remove this gene from the data set because of unknown reasons for such a high expression. Additional reason is that high expression of MALAT1 can be hard for normalization algorithms. So here we are going to remove MALAT1 genes from the further downstream analysis:
expr<-expr[-which(rownames(expr)=="MALAT1"),]
What about ribosomal gene families expression (RPS and RPL genes)? Their expression is also considered to be a contamination. Let us first display the ribosomal genes:
rownames(expr)[grepl(paste0(c("^RPS","^RPL"),collapse="|"),rownames(expr))]
## [1] "RPL22" "RPL11" "RPS6KA1" "RPS8" "RPL5" "RPS27"
## [7] "RPS6KC1" "RPS7" "RPS27A" "RPL31" "RPL37A" "RPL32"
## [13] "RPL15" "RPSA" "RPL14" "RPL29" "RPL24" "RPL22L1"
## [19] "RPL39L" "RPL35A" "RPL9" "RPL34.AS1" "RPL34" "RPS3A"
## [25] "RPL37" "RPS23" "RPS14" "RPL26L1" "RPS18" "RPS10"
## [31] "RPL10A" "RPL7L1" "RPS12" "RPS6KA2" "RPS6KA3" "RPS4X"
## [37] "RPS6KA6" "RPL36A" "RPL39" "RPL10" "RPS20" "RPL7"
## [43] "RPL30" "RPL8" "RPS6" "RPL35" "RPL12" "RPL7A"
## [49] "RPLP2" "RPL27A" "RPS13" "RPS6KA4" "RPS6KB2" "RPS3"
## [55] "RPS25" "RPS24" "RPS26" "RPL41" "RPL6" "RPLP0"
## [61] "RPL21" "RPS29" "RPL36AL" "RPS6KL1" "RPS6KA5" "RPS27L"
## [67] "RPL4" "RPLP1" "RPS17" "RPL3L" "RPS2" "RPS15A"
## [73] "RPL13" "RPL26" "RPL23A" "RPL23" "RPL19" "RPL27"
## [79] "RPS6KB1" "RPL38" "RPL17" "RPS21" "RPS15" "RPL36"
## [85] "RPS28" "RPL18A" "RPS16" "RPS19" "RPL18" "RPL13A"
## [91] "RPS11" "RPS9" "RPL28" "RPS5" "RPS4Y1" "RPS4Y2"
## [97] "RPL3" "RPS19BP1"
Let us now rank all cells by the percentage of ribosomal genes expressed in them:
ribosom_genes_expr_matrix <- expr[rownames(expr)[grepl(paste0(c("^RPS","^RPL"),collapse="|"),rownames(expr))],]
percent_ribosom_expr <- colSums(ribosom_genes_expr_matrix) / colSums(expr)
head(sort(percent_ribosom_expr, decreasing=TRUE), 20)
## C00558_human C06743_human C06698_human C05295_human C09079_human C08692_human
## 0.2023121 0.1824818 0.1698113 0.1634241 0.1579618 0.1562500
## C09111_human C06378_human C05025_human C07328_human C01087_human C06977_human
## 0.1546961 0.1538462 0.1525626 0.1504425 0.1500000 0.1481481
## C06938_human C01750_human C02122_human C01186_chimp C05326_chimp C09223_human
## 0.1455696 0.1453397 0.1438596 0.1433447 0.1428571 0.1402878
## C07894_human C01150_human
## 0.1388889 0.1382488
barplot(sort(percent_ribosom_expr, decreasing=TRUE), xlim=c(0,1000),las=2)
It looks like we have quite a few nuceli with high (up to 19%) of their UMI due to ribosomal gene families. What can it mean? Well, ribosomal proteins are house keeping genes that may not be clearly related to cell type. The ribosomal genes are often removed to make clustering of cell types more transparrent as they have more or less similar expression across cells and thus do not contribute to cell type identification. In this analysis we will remove the ribosomal protein genes:
expr<-expr[grepl(paste0(c("^RPS","^RPL"),collapse="|"),rownames(expr))==FALSE,]
Now let us display mitochondrial genes:
rownames(expr)[grepl("^MT\\.",rownames(expr))]
## [1] "MT.ND1" "MT.ND2" "MT.CO1" "MT.CO2" "MT.ATP8" "MT.ATP6" "MT.CO3"
## [8] "MT.ND3" "MT.ND4L" "MT.ND4" "MT.ND5" "MT.ND6" "MT.CYB"
Presence of those genes usually imply contamination, i.e. when cell membrane is broken, the nuclear RNA floats away but the mitochondrial RNA is still in tact. In our case, if we see nuclei with mitochondrial genes expressed, this would mean that mitochondrial got stuck to the nuclei somehow during the nuclei library prep. Anyhow, nuclei with mitochondrial genes expressed should be filtered out. Let us rank all cell by the percentage of mitochondrial genes expressed:
mito_genes_expr_matrix <- expr[rownames(expr)[grepl("^MT\\.",rownames(expr))],]
percent_mito_expr <- colSums(mito_genes_expr_matrix) / colSums(expr)
head(sort(percent_mito_expr, decreasing=TRUE), 20)
## C05741_human C01562_human C05007_human C05938_human C07894_human C08074_chimp
## 0.2860360 0.2067039 0.1985112 0.1927711 0.1720430 0.1638225
## C08927_chimp C07328_human C04358_human C07646_human C08821_chimp C08692_human
## 0.1577947 0.1562500 0.1530612 0.1530055 0.1495327 0.1481481
## C09347_human C09682_human C07050_human C06743_human C00533_chimp C00251_chimp
## 0.1476510 0.1465517 0.1448276 0.1428571 0.1340782 0.1319149
## C08516_human C06620_human
## 0.1318182 0.1317829
barplot(sort(percent_mito_expr, decreasing=TRUE), xlim=c(0,1000),las=2)
It seems that quite a few cells have high (up to 25%) percentage of their library sizes due to mitochndrial contamination. Let us display the names of the cells with >5% of their library size contributed by mitochondrial gene expression.
names(percent_mito_expr[percent_mito_expr>0.05])
## [1] "C00077_human" "C00251_chimp" "C00371_human" "C00403_human" "C00493_chimp"
## [6] "C00499_human" "C00502_human" "C00533_chimp" "C00677_human" "C00951_human"
## [11] "C01549_human" "C01562_human" "C01671_chimp" "C01718_human" "C01750_human"
## [16] "C01900_chimp" "C01929_human" "C02064_human" "C02088_human" "C02173_human"
## [21] "C02203_human" "C02669_human" "C02704_human" "C02860_human" "C03073_human"
## [26] "C03148_chimp" "C03175_chimp" "C03306_chimp" "C03317_human" "C03343_chimp"
## [31] "C03448_human" "C03589_human" "C03745_chimp" "C03847_human" "C03885_chimp"
## [36] "C04037_human" "C04070_human" "C04107_human" "C04317_chimp" "C04358_human"
## [41] "C04425_chimp" "C04713_human" "C04772_chimp" "C04816_human" "C04817_human"
## [46] "C05007_human" "C05012_human" "C05025_human" "C05068_human" "C05082_human"
## [51] "C05295_human" "C05305_human" "C05492_human" "C05719_human" "C05741_human"
## [56] "C05757_human" "C05833_chimp" "C05938_human" "C05992_chimp" "C06000_chimp"
## [61] "C06034_chimp" "C06136_chimp" "C06196_chimp" "C06245_human" "C06406_human"
## [66] "C06420_human" "C06435_chimp" "C06444_human" "C06578_human" "C06620_human"
## [71] "C06698_human" "C06720_human" "C06724_human" "C06734_human" "C06743_human"
## [76] "C06822_human" "C06841_human" "C06938_human" "C06996_human" "C07050_human"
## [81] "C07127_human" "C07182_chimp" "C07239_chimp" "C07274_chimp" "C07328_human"
## [86] "C07411_human" "C07418_human" "C07434_human" "C07448_chimp" "C07626_human"
## [91] "C07636_chimp" "C07646_human" "C07672_human" "C07863_human" "C07894_human"
## [96] "C07921_human" "C07979_human" "C08026_human" "C08074_chimp" "C08101_human"
## [101] "C08182_chimp" "C08217_human" "C08428_chimp" "C08435_human" "C08454_human"
## [106] "C08516_human" "C08543_human" "C08572_chimp" "C08651_human" "C08692_human"
## [111] "C08821_chimp" "C08915_human" "C08927_chimp" "C09079_human" "C09111_human"
## [116] "C09171_human" "C09223_human" "C09336_human" "C09347_human" "C09361_human"
## [121] "C09431_human" "C09511_human" "C09536_human" "C09598_human" "C09659_chimp"
## [126] "C09682_human" "C09708_chimp" "C09757_human" "C09860_human" "C09867_human"
sum(grepl("human",names(percent_mito_expr[percent_mito_expr>0.05]))==TRUE)/sum(grepl("human",names(percent_mito_expr)))
## [1] 0.03270869
sum(grepl("chimp",names(percent_mito_expr[percent_mito_expr>0.05]))==TRUE)/sum(grepl("chimp",names(percent_mito_expr)))
## [1] 0.009332967
We can see that the majority of contaminated cells come from the human, i.e. approximately 3% of human cells are contaminated while only about 1% of chimp cells demonstrate high mitochondrial gene expression. Here we are going to filter out the contaminated nuclei with >1% of their UMIs due to mitochondrial genes:
expr[,names(percent_mito_expr[percent_mito_expr>0.01])] <- NULL
Now everything is ready for running the Seurat pipeline. We should not see lots of cells with mitochondrial contamination at the end.
The steps below encompass the standard pre-processing workflow for scRNA-seq data in Seurat. These represent the creation of a Seurat object, the selection and filtration of cells based on QC metrics, data normalization and scaling, and the detection of highly variable features.
In order to proceed with Seurat pipeine we will have to convert the un-normalized expression matrix into Seurat object. We will keep all features expressed in >= 1 cells (~0.02% of the data) and keep all cells with at least 50 detected features:
skm <- CreateSeuratObject(counts = expr, project = "SKM", min.cells = 1, min.features = 0, meta.data=data.frame(row.names=colnames(expr),orig.ident=matrix(unlist(strsplit(colnames(expr),"_")),ncol=2,byrow=TRUE)[,2]))
## Warning: Feature names cannot have underscores ('_'), replacing with dashes
## ('-')
skm@active.ident<-as.factor(skm$orig.ident)
skm
## An object of class Seurat
## 17676 features across 5668 samples within 1 assay
## Active assay: RNA (17676 features, 0 variable features)
The number of features and UMIs (nFeature_RNA and nCount_RNA) are automatically calculated for every object by Seurat. For non-UMI data, nCount_RNA represents the sum of the non-normalized values within a cell i.e. library size. nFeature_RNA is a number of non-zero counts per cell. We calculate the percentage of mitochondrial features here and store it in object metadata as percent.mito. We use raw count data since this represents non-transformed and non-log-normalized counts. The % of UMI mapping to MT-features is a common scRNA-seq QC metric.
mito.features <- grep(pattern = "^MT\\.", x = rownames(x = skm), value = TRUE)
# Calculate % of mitochondrial genes per cell
percent.mito <- Matrix::colSums(x = GetAssayData(object = skm, slot = 'counts')[mito.features, ]) / Matrix::colSums(x = GetAssayData(object = skm, slot = 'counts'))
# The [[ operator can add columns to object metadata, and is a great place to stash QC stats
skm[['percent.mito']] <- percent.mito
VlnPlot(object = skm, features = c("nFeature_RNA", "nCount_RNA", "percent.mito"), ncol = 3)
Thus we have in average 200 genes detected per cell while a typical library size is 400-450. This is not a fantastic coverage but still tolerable for 10X data. Good news is that we do not seem to have many detected mitochondrial genes, i.e. they contribute less than 1% of the library sizes of the cells. Now we will check how nFeature_RNA is connected with nCount_RNA:
FeatureScatter(object = skm, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
We conclude that larger library sizes (nCount_RNA) give more detected genes per cell (nFeature_RNA).
Now we will normalize the data. By default, Seurat employs a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. However, one can also use a more recent and intelligent SCTransform normalization, https://pubmed.ncbi.nlm.nih.gov/31870423/, that regresses out the dependence of gene expression of sequencing depth individually for lowly, moderately and highly expressed genes.
skm <- NormalizeData(object = skm, normalization.method = "LogNormalize", scale.factor = 1e4)
# store mitochondrial percentage in object meta data
#skm <- PercentageFeatureSet(skm, pattern = "^MT-", col.name = "percent.mt")
# run sctransform
#skm <- SCTransform(skm, vars.to.regress = "percent.mt", verbose = TRUE)
Now we will identify Highly Variable Genes (HVGs) wich will be used for dimensionality reduction. FindVariableFeatures calculates the average expression and dispersion for each feature, places these features into bins, and then calculates a z-score for dispersion within each bin. This helps control for the relationship between variability and average expression.
skm <- FindVariableFeatures(object = skm, nfeatures = dim(skm)[1])
length(x = VariableFeatures(object = skm))
## [1] 17676
Thus we have 17676 Highly Variable Genes (HVGs) which we will use for further downstream analysis.
The single cell dataset likely contains "uninteresting" sources of variation. This could include not only technical noise, but batch effects, or even biological sources of variation (cell cycle stage). To mitigate the effect of these signals, Seurat constructs linear models to predict feature expression based on user-defined variables. The scaled z-scored residuals of these models are stored in the scale.data slot, and are used for dimensionality reduction and clustering.
We can regress out cell-cell variation in feature expression driven by batch (if applicable), cell alignment rate (as provided by Drop-seq tools for Drop-seq data), the number of detected molecules, and mitochondrial feature expression. For cycling cells, we can also learn a "cell-cycle" score and regress this out as well. Here we regress on the number of detected molecules per cell.
skm <- ScaleData(object = skm, features = rownames(x = skm), vars.to.regress = c("nCount_RNA","percent.mito"))
## Regressing out nCount_RNA, percent.mito
## Centering and scaling data matrix
Next we will perform linear Dimensionality Reduction which is a Principal Component Analysis (PCA) in this case:
skm <- RunPCA(object = skm, features = VariableFeatures(object = skm), verbose = FALSE)
We can see genes contributing to the loadings of the principal components. Let us now visualize loadings for the first two PCs:
print(x = skm[['pca']], dims = 1:5, nfeatures = 5, projected = FALSE)
## PC_ 1
## Positive: EMC10, CTD.2545M3.8, MYBPC1, MIR1.1HG, AC074117.10
## Negative: ABCA5, TRIM63, KLHL24, RBFOX1, ABCA10
## PC_ 2
## Positive: NOVA1, APOD, TNXB, ABCA8, COL6A1
## Negative: TTN, NEB, NEAT1, TRDN, MIR133A1HG
## PC_ 3
## Positive: MYH7B, TNNT1, ATP2A2, TPM3, XPO4
## Negative: TNNT3, ATP2A1, TPM1, MYH1, MYH2
## PC_ 4
## Positive: NOVA1, ABCA8, APOD, TNXB, CFD
## Negative: HLA.B, EGFL7, HIF3A, HLA.E, B2M
## PC_ 5
## Positive: NEXN, TECRL, PDK4, PFKFB3, TRIM63
## Negative: ATP2A1, MYLPF, TRDN, GGT7, MYH2
VizDimLoadings(object = skm, dims = 1:2)
and display the PCA plot itself:
DimPlot(object = skm, group.by="orig.ident", pt.size=1.2)
skm <- ProjectDim(object = skm)
## PC_ 1
## Positive: EMC10, CTD.2545M3.8, MYBPC1, MIR1.1HG, AC074117.10, ACTA1, RHOBTB1, CMYA5, MYOT, MYOM2
## COL4A3, RP11.227H15.4, MIR133A1HG, TPM2, LINC01116, TXNIP, MYH2, EHBP1L1, MYF6, DDX5
## Negative: ABCA5, TRIM63, KLHL24, RBFOX1, ABCA10, FBXO32, ACTN3, RP11.166B2.1, PFKFB3, NPIPB15
## SLC35G3, RP11.894P9.1, RIF1, AC005523.2, RP5.947P14.1, AAK1, LPIN1, FAM134B, ABCC5, PLIN5
## PC_ 2
## Positive: NOVA1, APOD, TNXB, ABCA8, COL6A1, GSN, COL6A2, CFD, DCN, MEG3
## COL4A1, SERPING1, SMOC2, HSPG2, LTBP4, SPARCL1, CALD1, LRP1, PRELP, PDGFRB
## Negative: TTN, NEB, NEAT1, TRDN, MIR133A1HG, MYBPC1, EMC10, NEXN, ABCA5, PDLIM3
## FBXO32, YBX3, CMYA5, PDE4DIP, NRAP, PDLIM5, TNNT3, ATP1A2, OBSCN, CTD.2545M3.8
## PC_ 3
## Positive: MYH7B, TNNT1, ATP2A2, TPM3, XPO4, MYH7, TNNI1, USP54, MYBPC1, TECRL
## PLIN5, CD36, LINC00958, PRSS45, LGR5, ATP1B4, AQP7, BEND7, CFLAR, MYL2
## Negative: TNNT3, ATP2A1, TPM1, MYH1, MYH2, TNNI2, TNNC2, ABCA5, MYLPF, ACTN3
## PFKFB3, TRIM63, GGT7, IGFN1, ENO3, MYBPC2, LPIN1, FAM166B, CMYA5, POMT2
## PC_ 4
## Positive: NOVA1, ABCA8, APOD, TNXB, CFD, COL6A1, DCN, COL6A2, SMOC2, COL6A3
## HTRA3, MEG3, ADM, PRELP, GSN, SERPING1, ABCA9, LRP1, LTBP4, ADH1B
## Negative: HLA.B, EGFL7, HIF3A, HLA.E, B2M, CAV1, TMSB10, BTNL9, IFITM3, FABP4
## VWF, IGF2, RNASE1, ADGRF5, FLT1, CD74, EMCN, KLF2, XAF1, ITGA1
## PC_ 5
## Positive: NEXN, TECRL, PDK4, PFKFB3, TRIM63, ABCA5, POMT2, FBXO32, LPIN1, FEZ2
## YBX3, IL17D, CD36, NRAP, MYH7B, FAM134B, ZFAND5, NEAT1, UCP3, CCDC39
## Negative: ATP2A1, MYLPF, TRDN, GGT7, MYH2, MYH1, AGL, TPM2, TPM1, MYL1
## TNNT3, RYR3, ASB18, ATXN7L2, DISC1, AC007228.9, CASQ1, TNNI2, MYBPC2, LINC00958
DimHeatmap(object = skm, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(object = skm, dims = 2, cells = 500, balanced = TRUE)
DimHeatmap(object = skm, dims = 3, cells = 500, balanced = TRUE)
DimHeatmap(object = skm, dims = 1:12, cells = 500, balanced = TRUE)
The next step is to decide how many principal components to keep for further downstream analysis. This step is another filtering step, in this way we eliminate noisy genes and keep only most informative ones. PC selection can be viewed as identifying the true dimensionality of a dataset
Seurat uses a resampling test inspired by the "JackStraw" procedure. It randomly permutes a subset of the data (1% by default) and rerun PCA, constructing a ‘null distribution’ of gene scores, and repeat this procedure. We identify ‘significant’ PCs as those who have a strong enrichment of low p-value genes.
skm <- JackStraw(object = skm, num.replicate = 10)
skm <- ScoreJackStraw(object = skm, dims = 1:20)
JackStrawPlot(object = skm, dims = 1:20)
## Warning: Removed 301012 rows containing missing values (geom_point).
ElbowPlot(object = skm)
Seurat provides a visualization tool for comparing the distribution of p-values for each PC with a uniform distribution (dashed line). ‘Significant’ PCs will show a strong enrichment of genes with low p-values (solid curve above the dashed line). In this case it appears that PCs 1-5 are significant.
Having created the PCA plot we can now use known cell cycle gene markers in order to assign a cell cycle status to each cell in order to see if the cycling cells form a separate cluster. A list of cell cycle markers, from Tirosh et al, 2015, is loaded with Seurat. We can segregate this list into markers of G2/M phase and markers of S phase:
cc.genes
## $s.genes
## [1] "MCM5" "PCNA" "TYMS" "FEN1" "MCM2" "MCM4"
## [7] "RRM1" "UNG" "GINS2" "MCM6" "CDCA7" "DTL"
## [13] "PRIM1" "UHRF1" "MLF1IP" "HELLS" "RFC2" "RPA2"
## [19] "NASP" "RAD51AP1" "GMNN" "WDR76" "SLBP" "CCNE2"
## [25] "UBR7" "POLD3" "MSH2" "ATAD2" "RAD51" "RRM2"
## [31] "CDC45" "CDC6" "EXO1" "TIPIN" "DSCC1" "BLM"
## [37] "CASP8AP2" "USP1" "CLSPN" "POLA1" "CHAF1B" "BRIP1"
## [43] "E2F8"
##
## $g2m.genes
## [1] "HMGB2" "CDK1" "NUSAP1" "UBE2C" "BIRC5" "TPX2" "TOP2A"
## [8] "NDC80" "CKS2" "NUF2" "CKS1B" "MKI67" "TMPO" "CENPF"
## [15] "TACC3" "FAM64A" "SMC4" "CCNB2" "CKAP2L" "CKAP2" "AURKB"
## [22] "BUB1" "KIF11" "ANP32E" "TUBB4B" "GTSE1" "KIF20B" "HJURP"
## [29] "CDCA3" "HN1" "CDC20" "TTK" "CDC25C" "KIF2C" "RANGAP1"
## [36] "NCAPD2" "DLGAP5" "CDCA2" "CDCA8" "ECT2" "KIF23" "HMMR"
## [43] "AURKA" "PSRC1" "ANLN" "LBR" "CKAP5" "CENPE" "CTCF"
## [50] "NEK2" "G2E3" "GAS2L3" "CBX5" "CENPA"
s.genes <- cc.genes$s.genes
g2m.genes <- cc.genes$g2m.genes
First, we assign each cell a score, based on its expression of G2/M and S phase markers. These marker sets should be anticorrelated in their expression levels, and cells expressing neither are likely not cycling and in G1 phase.
skm <- CellCycleScoring(object = skm, s.features = s.genes, g2m.features = g2m.genes, set.ident = TRUE)
## Warning: The following features are not present in the object: TYMS, DTL,
## MLF1IP, WDR76, RAD51, RRM2, CDC45, CDC6, EXO1, not searching for symbol synonyms
## Warning: The following features are not present in the object: CDK1, UBE2C,
## TOP2A, NDC80, NUF2, FAM64A, AURKB, BUB1, HJURP, CDCA3, TTK, CDC25C, KIF2C,
## DLGAP5, CDCA2, KIF23, CENPE, GAS2L3, CENPA, not searching for symbol synonyms
# view cell cycle scores and phase assignments
head(x = skm[[]])
## orig.ident nCount_RNA nFeature_RNA percent.mito S.Score
## C00001_chimp chimp 254 217 0.000000000 -0.016113677
## C00004_human human 254 200 0.000000000 -0.008056839
## C00006_chimp chimp 249 216 0.000000000 -0.016198181
## C00007_chimp chimp 95 89 0.000000000 -0.010165400
## C00008_chimp chimp 265 228 0.000000000 -0.013278016
## C00009_human human 202 157 0.004950495 0.000000000
## G2M.Score Phase old.ident
## C00001_chimp -0.023571388 G1 chimp
## C00004_human -0.007857129 G1 human
## C00006_chimp -0.023695002 G1 chimp
## C00007_chimp -0.003304475 G1 chimp
## C00008_chimp -0.036256882 G1 chimp
## C00009_human -0.013888355 S human
table(skm[[]]$Phase)
##
## G1 G2M S Undecided
## 4193 770 702 3
skm@meta.data$old.ident<-as.factor(skm@active.ident)
Good news is that the vast majority of cells do not seem to be cycling, i.e. they are in the G1 phase. Around 1600 cells though seem to be in G2M / S phase, which is not that bad, but the most important is to check now that those ~1600 cells do not form a separate cluster but spread uniformly through the cell lineage clusters.
skm <- RunPCA(object = skm, features = VariableFeatures(object = skm), verbose = FALSE)
DimPlot(object = skm, pt.size=1.2)
Well, we do not seem to observe a separate cluster for the cycling cells on the PCA plot, the G2M and S cells seem to be spread across the PCA plot. What about tSNE, will we see the cycling cells cluster there?
Seurat uses graph-based clustering algorithm similar to Shared Nearest Neighbor (SNN). This method embeds cells in a graph structure - for example a K-nearest neighbor (KNN) graph, with edges drawn between cells with similar gene expression patterns, and then attempts to partition this graph into highly interconnected ‘quasi-cliques’ or ‘communities’. We first construct a KNN graph based on the euclidean distance in PCA space, and refine the edge weights between any two cells based on the shared overlap in their local neighborhoods (Jaccard similarity). To cluster the cells, we apply modularity optimization techniques such as the Louvain algorithm (default).
skm <- FindNeighbors(object = skm, dims = 1:5)
## Computing nearest neighbor graph
## Computing SNN
skm <- FindClusters(object = skm, resolution = 0.3)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 5668
## Number of edges: 169881
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9048
## Number of communities: 5
## Elapsed time: 0 seconds
Here we will also identify doublets using DoubletFinder tool:
#prior expectation is that 4% of cells are doublets
nExp <- round(ncol(skm) * 0.04)
skm <- doubletFinder_v3(skm, pN = 0.25, pK = 0.09, nExp = nExp, PCs = 1:5)
## Loading required package: fields
## Loading required package: spam
## Spam version 2.9-1 (2022-08-07) is loaded.
## Type 'help( Spam)' or 'demo( spam)' for a short introduction
## and overview of this package.
## Help for individual functions is also obtained by adding the
## suffix '.spam' to the function name, e.g. 'help( chol.spam)'.
##
## Attaching package: 'spam'
## The following objects are masked from 'package:base':
##
## backsolve, forwardsolve
## Loading required package: viridis
## Loading required package: viridisLite
##
## Try help(fields) to get started.
## Loading required package: KernSmooth
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## [1] "Creating 1889 artificial doublets..."
## [1] "Creating Seurat object..."
## [1] "Normalizing Seurat object..."
## [1] "Finding variable genes..."
## [1] "Scaling data..."
## Centering and scaling data matrix
## [1] "Running PCA..."
## [1] "Calculating PC distance matrix..."
## [1] "Computing pANN..."
## [1] "Classifying doublets.."
#save doublet finder prediction name
DF.name = colnames(skm@meta.data)[grepl("DF.classification", colnames(skm@meta.data))]
Here we perform dimensionality reduction with tSNE and color the identified clusters by the graph-based clustering algorithm.
opt_perp <- round(sqrt(length(skm@active.ident)),0)
opt_perp
## [1] 75
skm <- RunTSNE(object = skm, dims = 1:5, perplexity = opt_perp, check_duplicates = FALSE)
#skm <- RunUMAP(object = skm, reduction.use = "pca", dims = 1:20, min_dist = 0.75)
DimPlot(object = skm, reduction = 'tsne', pt.size=1.2, label=TRUE, label.size=8)
DimPlot(object = skm, reduction = 'tsne', group.by="orig.ident", pt.size=1.2)
DimPlot(object = skm, reduction = 'tsne', group.by="old.ident", pt.size=1.2)
DimPlot(object = skm, reduction = 'tsne', group.by=DF.name, pt.size=1.2)
DimPlot(object = skm, reduction = 'pca', pt.size=1.2)
It looks like human cells form two very distinct clusters, while the chimp cells are less obviously clustered into two blobs. The forths cluster where the human and chimp cells overlap might be some rare cell population or just poor quality cells, some more thorough analysis is needed here. Good news is that the cycling cells are spread homogeneously across all the clusters and do not form a separate cluster.
Let us check the numbers of cells in each cluster:
table(skm@active.ident)
##
## 0 1 2 3 4
## 2079 1175 1053 992 369
cell_assignment<-data.frame(CELL=names(skm@active.ident),CLUSTER=as.numeric(as.character(skm@active.ident)))
print(head(cell_assignment))
## CELL CLUSTER
## 1 C00001_chimp 3
## 2 C00004_human 2
## 3 C00006_chimp 0
## 4 C00007_chimp 0
## 5 C00008_chimp 3
## 6 C00009_human 1
write.table(cell_assignment,file="snRNAseq_cell_assignment.txt",col.names=TRUE,row.names=FALSE,quote=FALSE,sep="\t")
write.table(expr,file="snRNAseq_Human_Chimp_SKM_SeuratFiltered.txt",col.names=TRUE,row.names=TRUE,quote=FALSE,sep="\t")
cell_human_assignment<-cell_assignment[cell_assignment$CLUSTER==1 | cell_assignment$CLUSTER==2,]
cell_human_assignment<-cell_human_assignment[grepl("human",as.character(cell_human_assignment$CELL))==TRUE,]
print(head(cell_human_assignment))
## CELL CLUSTER
## 2 C00004_human 2
## 6 C00009_human 1
## 7 C00011_human 1
## 8 C00014_human 1
## 10 C00018_human 1
## 12 C00024_human 1
expr_human<-subset(expr,select=colnames(expr)[colnames(expr)%in%as.character(cell_human_assignment$CELL)])
write.table(cell_human_assignment,file="snRNAseq_cell_human_assignment.txt",col.names=TRUE,row.names=FALSE,quote=FALSE,sep="\t")
write.table(expr_human,file="snRNAseq_Human_SKM_SeuratFiltered.txt",col.names=TRUE,row.names=TRUE,quote=FALSE,sep="\t")
cell_chimp_assignment<-cell_assignment[cell_assignment$CLUSTER==0 | cell_assignment$CLUSTER==3,]
cell_chimp_assignment<-cell_chimp_assignment[grepl("chimp",as.character(cell_chimp_assignment$CELL))==TRUE,]
print(head(cell_chimp_assignment))
## CELL CLUSTER
## 1 C00001_chimp 3
## 3 C00006_chimp 0
## 4 C00007_chimp 0
## 5 C00008_chimp 3
## 9 C00015_chimp 0
## 11 C00021_chimp 0
expr_chimp<-subset(expr,select=colnames(expr)[colnames(expr)%in%as.character(cell_chimp_assignment$CELL)])
write.table(cell_chimp_assignment,file="snRNAseq_cell_chimp_assignment.txt",col.names=TRUE,row.names=FALSE,quote=FALSE,sep="\t")
write.table(expr_chimp,file="snRNAseq_Chimp_SKM_SeuratFiltered.txt",col.names=TRUE,row.names=TRUE,quote=FALSE,sep="\t")
Let us find markers for each cluster:
# find all markers of cluster 0,1,2,3,4
cluster0.markers <- FindMarkers(object = skm, ident.1 = 0, min.pct = 0.25)
head(x = cluster0.markers, n = 10)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## ABCA5 0.000000e+00 1.968778 0.785 0.393 0.000000e+00
## EMC10 0.000000e+00 -2.811481 0.285 0.754 0.000000e+00
## MYBPC1 1.058703e-218 -1.597010 0.428 0.785 1.871363e-214
## CTD.2545M3.8 2.749522e-201 -2.272073 0.148 0.559 4.860055e-197
## TRIM63 3.647386e-191 1.595983 0.669 0.354 6.447119e-187
## ACTN3 6.224073e-135 1.622434 0.511 0.239 1.100167e-130
## PFKFB3 3.476708e-131 2.009229 0.406 0.146 6.145429e-127
## FBXO32 8.752152e-121 1.405274 0.599 0.365 1.547030e-116
## MIR1.1HG 2.549475e-117 -1.534578 0.212 0.526 4.506453e-113
## TNNT1 1.195500e-113 -1.120919 0.482 0.744 2.113166e-109
cluster1.markers <- FindMarkers(object = skm, ident.1 = 1, min.pct = 0.25)
head(x = cluster1.markers, n = 10)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## EMC10 0.000000e+00 2.1940064 0.968 0.481 0.000000e+00
## CTD.2545M3.8 2.926971e-266 1.9639403 0.800 0.306 5.173714e-262
## ATP2A1 8.084699e-216 1.8381414 0.711 0.257 1.429051e-211
## TNNT3 6.022502e-199 1.2873853 0.926 0.601 1.064538e-194
## MYH2 7.416282e-172 1.8667194 0.644 0.255 1.310902e-167
## ACTA1 3.866663e-132 0.8419152 0.955 0.809 6.834714e-128
## MIR1.1HG 5.352912e-125 1.4144068 0.680 0.340 9.461808e-121
## RHOBTB1 1.056647e-124 1.9251502 0.402 0.114 1.867729e-120
## MYLPF 2.081148e-120 1.3669615 0.665 0.327 3.678637e-116
## CMYA5 4.758643e-98 1.0959816 0.721 0.407 8.411377e-94
cluster2.markers <- FindMarkers(object = skm, ident.1 = 2, min.pct = 0.25)
head(x = cluster2.markers, n = 10)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## MYBPC1 6.235324e-299 1.649348 0.946 0.587 1.102156e-294
## MYH7B 9.151235e-233 2.239518 0.594 0.137 1.617572e-228
## TNNT1 2.789422e-182 1.322267 0.919 0.586 4.930582e-178
## EMC10 5.267321e-178 1.013012 0.953 0.498 9.310516e-174
## ATP2A2 1.838596e-140 1.528834 0.646 0.249 3.249903e-136
## MYH7 1.782013e-123 1.403237 0.744 0.386 3.149886e-119
## TPM3 3.998609e-111 1.365055 0.570 0.217 7.067942e-107
## TNNT3 4.612614e-103 -1.464542 0.479 0.712 8.153256e-99
## ABCA5 2.143368e-91 -1.880241 0.320 0.586 3.788617e-87
## CD36 9.830687e-91 1.772018 0.293 0.073 1.737672e-86
cluster3.markers <- FindMarkers(object = skm, ident.1 = 3, min.pct = 0.25)
head(x = cluster3.markers, n = 10)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## TNNT3 2.665702e-94 -1.4431972 0.448 0.715 4.711895e-90
## PLIN5 7.330228e-76 1.4533224 0.378 0.134 1.295691e-71
## TNNT1 8.462649e-71 0.7023938 0.865 0.602 1.495858e-66
## USP54 4.597088e-70 1.5238673 0.354 0.127 8.125813e-66
## TRDN 1.799925e-60 0.6253092 0.887 0.706 3.181548e-56
## EMC10 8.829050e-55 -1.8935845 0.506 0.598 1.560623e-50
## ATP2A1 1.564240e-47 -1.7363414 0.170 0.390 2.764950e-43
## XPO4 7.842923e-43 1.3538670 0.255 0.099 1.386315e-38
## CMYA5 6.437480e-41 -1.1371978 0.309 0.506 1.137889e-36
## ACADVL 1.099505e-37 0.9489502 0.399 0.208 1.943485e-33
cluster4.markers <- FindMarkers(object = skm, ident.1 = 4, min.pct = 0.25)
head(x = cluster4.markers, n = 10)
## p_val avg_log2FC pct.1 pct.2 p_val_adj
## TTN 1.855165e-152 -1.967681 0.599 0.988 3.279189e-148
## NEAT1 1.024693e-113 -1.269650 0.846 0.994 1.811247e-109
## GSN 2.438149e-111 3.181977 0.393 0.067 4.309671e-107
## APOD 2.084673e-109 3.536156 0.358 0.055 3.684868e-105
## MEG3 1.267216e-101 2.633715 0.285 0.035 2.239930e-97
## NEB 1.892696e-100 -1.782932 0.444 0.903 3.345529e-96
## CFD 7.731879e-98 3.336105 0.266 0.032 1.366687e-93
## DCN 3.038665e-89 3.057368 0.282 0.041 5.371143e-85
## TRDN 1.696371e-75 -1.989270 0.263 0.771 2.998506e-71
## EMC10 5.303345e-58 -2.827977 0.176 0.610 9.374193e-54
#cluster5.markers <- FindMarkers(object = skm, ident.1 = 5, min.pct = 0.25)
#head(x = cluster5.markers, n = 10)
#cluster6.markers <- FindMarkers(object = skm, ident.1 = 6, min.pct = 0.25)
#head(x = cluster6.markers, n = 10)
..and display violin plots for a couple of cluster markers:
VlnPlot(object = skm, features = c("MYBPC1", "MYH7B","ABCA5"))
Now we can find markers for every cluster compared to all remaining cells, report only the positive ones:
skm.markers <- FindAllMarkers(object = skm, only.pos = TRUE, min.pct = 0.25, test.use = "roc", logfc.threshold = 0.5)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
skm.markers[order(-skm.markers$avg_log2FC),]
## myAUC avg_diff power avg_log2FC pct.1 pct.2 cluster gene
## MYH7B 0.734 1.5523153 0.468 2.2395176 0.594 0.137 2 MYH7B
## EMC10 0.898 1.5207693 0.796 2.1940064 0.968 0.481 1 EMC10
## ABCA5 0.792 1.3646530 0.584 1.9687781 0.785 0.393 0 ABCA5
## CTD.2545M3.8 0.794 1.3612997 0.588 1.9639403 0.800 0.306 1 CTD.2545M3.8
## MYH2 0.722 1.2939113 0.444 1.8667194 0.644 0.255 1 MYH2
## ATP2A1 0.753 1.2741025 0.506 1.8381414 0.711 0.257 1 ATP2A1
## MYBPC1 0.857 1.1432410 0.714 1.6493482 0.946 0.587 2 MYBPC1
## TRIM63 0.716 1.1062508 0.432 1.5959826 0.669 0.354 0 TRIM63
## ATP2A2 0.707 1.0597073 0.414 1.5288344 0.646 0.249 2 ATP2A2
## MIR1.1HG 0.701 0.9803921 0.402 1.4144068 0.680 0.340 1 MIR1.1HG
## MYH7 0.713 0.9726501 0.426 1.4032375 0.744 0.386 2 MYH7
## TNNT1 0.778 0.9165255 0.556 1.3222668 0.919 0.586 2 TNNT1
## TNNT3 0.779 0.8923475 0.558 1.2873853 0.926 0.601 1 TNNT3
## EMC101 0.770 0.7021661 0.540 1.0130115 0.953 0.498 2 EMC10
## ACTA1 0.731 0.5835712 0.462 0.8419152 0.955 0.809 1 ACTA1
#skm.markers %>% group_by(cluster) %>% top_n(n = 5, wt = avg_logFC)
Looks like the fifth cluster is special as it is depleted for the TTN gene which is highly expressed everywhere but in the fifth cluster. We can also color the cells on tSNE plot by the marker genes the predominantly express:
#FeaturePlot(object = skm, features = c("ABCA5","TRIM63","EMC10","ATP2A1","MYH7", "MYH7B","ACTN3","MYH1","MYH2"))
#FeaturePlot(object = skm, features = c("TTN","MT.CO1","EMC10","ATP2A1","ATP2A2","MYH7", "MYH7B","MYH1","MYH2"))
FeaturePlot(object = skm, features = c("EMC10","ATP2A2","MYH7","MYH7B","ATP2A1","MYH1","MYH2","TTN","DCN"),min.cutoff=c(NA,NA,NA,NA,NA,NA,NA,"q1",NA))
We can also plot the top 20 markers (or all markers if less than 20) for each cluster:
skm.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC) -> top10
DoHeatmap(object = skm, features = top10$gene) + NoLegend()
The clustering is not very transparent implying that more thorough filtering is needed, alternatively, the cells look biologically very similar. We can assign final labels to clusters based on their gene markers:
new.cluster.ids <- c("Chimp Fast Twitch", "Human Fast Twitch", "Human Slow Twitch", "Chimp Slow Twitch", "Unclear Cells", "Poor Quality Cells")
names(x = new.cluster.ids) <- levels(x = skm)
skm <- RenameIdents(object = skm, new.cluster.ids)
## Warning: Cannot find identity NA
DimPlot(object = skm, reduction = 'tsne', label = TRUE, pt.size=1, label.size=6) + NoLegend()
Now we will check proportions of cells expressing skeletal muscle slow and fast twitch gene markers. We will do it separately for humans and chimps. From the previous analysis, it seems like the difference between slow and fast twitch is more pronounced in humans compared to chimps where cells are less heterogeneous in sense of traditional fast and slow twitch gene markers like MYH2 and MYH7. Let us with selecting only the genes and cells in the original expression matrix that passed the QC and split the matrix into human and chimp expression matrices:
qc_cells<-colnames(skm@assays$RNA@counts)
qc_genes<-rownames(skm@assays$RNA@counts)
expr_qc<-subset(expr,select=qc_cells)
expr_qc<-expr_qc[match(qc_genes,rownames(expr_qc)),]
expr_qc[1:5,1:5]
## C00001_chimp C00004_human C00006_chimp C00007_chimp C00008_chimp
## RP11.34P13.7 0 0 0 0 0
## FO538757.2 0 0 0 0 0
## AP006222.2 0 0 0 0 0
## RP4.669L17.10 0 0 0 0 0
## RP5.857K21.4 0 0 0 0 0
expr_qc_human<-subset(expr_qc,select=colnames(expr_qc)[grepl("_human",colnames(expr_qc))])
expr_qc_human[1:5,1:5]
## C00004_human C00009_human C00011_human C00014_human C00018_human
## RP11.34P13.7 0 0 0 0 0
## FO538757.2 0 0 0 0 0
## AP006222.2 0 0 0 0 0
## RP4.669L17.10 0 0 0 0 0
## RP5.857K21.4 0 0 0 0 0
expr_qc_chimp<-subset(expr_qc,select=colnames(expr_qc)[grepl("_chimp",colnames(expr_qc))])
expr_qc_chimp[1:5,1:5]
## C00001_chimp C00006_chimp C00007_chimp C00008_chimp C00015_chimp
## RP11.34P13.7 0 0 0 0 0
## FO538757.2 0 0 0 0 0
## AP006222.2 0 0 0 0 0
## RP4.669L17.10 0 0 0 0 0
## RP5.857K21.4 0 0 0 0 0
Now we are going to select only genes that are gene markers for type 1 and type 2 muscle fibers. We will select: 1) MYH1, MYH2 and ATP2A1 as gene markers for fast twitch fibers, and 2) MYH7B, MYH7 and ATP2A2 as gene markers for slow twitch fibers.
expr_qc_human_fast<-expr_qc_human[match(c("MYH1","MYH2","ATP2A1"),rownames(expr_qc_human)),]
expr_qc_human_fast[1:3,1:5]
## C00004_human C00009_human C00011_human C00014_human C00018_human
## MYH1 0 0 0 0 4
## MYH2 0 0 1 3 1
## ATP2A1 0 1 0 2 0
expr_qc_human_slow<-expr_qc_human[match(c("MYH7B","MYH7","ATP2A2"),rownames(expr_qc_human)),]
expr_qc_human_slow[1:3,1:5]
## C00004_human C00009_human C00011_human C00014_human C00018_human
## MYH7B 2 0 0 0 0
## MYH7 1 0 0 0 2
## ATP2A2 1 1 0 0 2
expr_qc_human_fast_slow<-expr_qc_human[match(c("MYH1","MYH2","ATP2A1","MYH7B","MYH7","ATP2A2"),rownames(expr_qc_human)),]
expr_qc_human_fast_slow[1:6,1:6]
## C00004_human C00009_human C00011_human C00014_human C00018_human
## MYH1 0 0 0 0 4
## MYH2 0 0 1 3 1
## ATP2A1 0 1 0 2 0
## MYH7B 2 0 0 0 0
## MYH7 1 0 0 0 2
## ATP2A2 1 1 0 0 2
## C00024_human
## MYH1 3
## MYH2 0
## ATP2A1 0
## MYH7B 0
## MYH7 2
## ATP2A2 1
expr_qc_chimp_fast<-expr_qc_chimp[match(c("MYH1","MYH2","ATP2A1"),rownames(expr_qc_chimp)),]
expr_qc_chimp_fast[1:3,1:5]
## C00001_chimp C00006_chimp C00007_chimp C00008_chimp C00015_chimp
## MYH1 1 0 0 0 0
## MYH2 0 1 1 1 0
## ATP2A1 0 0 1 1 0
expr_qc_chimp_slow<-expr_qc_chimp[match(c("MYH7B","MYH7","ATP2A2"),rownames(expr_qc_chimp)),]
expr_qc_chimp_slow[1:3,1:5]
## C00001_chimp C00006_chimp C00007_chimp C00008_chimp C00015_chimp
## MYH7B 0 0 0 1 1
## MYH7 0 1 0 0 0
## ATP2A2 1 0 0 2 0
expr_qc_chimp_fast_slow<-expr_qc_chimp[match(c("MYH1","MYH2","ATP2A1","MYH7B","MYH7","ATP2A2"),rownames(expr_qc_chimp)),]
expr_qc_chimp_fast_slow[1:6,1:6]
## C00001_chimp C00006_chimp C00007_chimp C00008_chimp C00015_chimp
## MYH1 1 0 0 0 0
## MYH2 0 1 1 1 0
## ATP2A1 0 0 1 1 0
## MYH7B 0 0 0 1 1
## MYH7 0 1 0 0 0
## ATP2A2 1 0 0 2 0
## C00021_chimp
## MYH1 0
## MYH2 1
## ATP2A1 0
## MYH7B 0
## MYH7 0
## ATP2A2 1
Next, we are going to check how many cells in human express both type 1 and type 2 marker genes. For this purpose, we require that the mean expression of both slow and fast twitch gene markers was higher than mean expression of all markers across all cells human, which is close to 1, i.e. something like 0.73. We need to do it in this way because there is apparently a difference in total gne expression of those slow and fast twitch gene markers between humans and chimps:
boxplot(log10(as.numeric(colMeans(expr_qc_human_fast_slow))+1),log10(as.numeric(colMeans(expr_qc_chimp_fast_slow))+1),ylab="log10 expression of fast+slow twitch markers",names=c("HUMAN","CHIMP"))
mean(as.numeric(colMeans(expr_qc_human_fast_slow)))
## [1] 0.7329309
mean(as.numeric(colMeans(expr_qc_chimp_fast_slow)))
## [1] 0.3381302
Let us now count how many cells express both type 1 and type 2 fibers on the expression level above the average expression level:
expr_qc_human_fast_mean<-colMeans(expr_qc_human_fast)
expr_qc_human_fast_mean_filtered<-expr_qc_human_fast_mean[as.numeric(expr_qc_human_fast_mean)>=mean(as.numeric(colMeans(expr_qc_human_fast_slow)))]
length(expr_qc_human_fast_mean_filtered)
## [1] 810
head(expr_qc_human_fast_mean_filtered)
## C00014_human C00018_human C00024_human C00029_human C00046_human C00052_human
## 1.666667 1.666667 1.000000 1.000000 1.666667 1.000000
expr_qc_human_slow_mean<-colMeans(expr_qc_human_slow)
expr_qc_human_slow_mean_filtered<-expr_qc_human_slow_mean[as.numeric(expr_qc_human_slow_mean)>=mean(as.numeric(colMeans(expr_qc_human_fast_slow)))]
length(expr_qc_human_slow_mean_filtered)
## [1] 817
head(expr_qc_human_slow_mean_filtered)
## C00004_human C00018_human C00024_human C00031_human C00033_human C00068_human
## 1.333333 1.333333 1.000000 1.666667 2.666667 1.333333
length(intersect(names(expr_qc_human_fast_mean_filtered),names(expr_qc_human_slow_mean_filtered)))
## [1] 91
head(intersect(names(expr_qc_human_fast_mean_filtered),names(expr_qc_human_slow_mean_filtered)))
## [1] "C00018_human" "C00024_human" "C00105_human" "C00422_human" "C00471_human"
## [6] "C00489_human"
expr_qc_human_fast_slow[,match(intersect(names(expr_qc_human_fast_mean_filtered),names(expr_qc_human_slow_mean_filtered)),colnames(expr_qc_human_fast_slow))][1:6,1:6]
## C00018_human C00024_human C00105_human C00422_human C00471_human
## MYH1 4 3 1 0 6
## MYH2 1 0 3 1 0
## ATP2A1 0 0 0 2 2
## MYH7B 0 0 0 1 0
## MYH7 2 2 1 1 3
## ATP2A2 2 1 2 1 0
## C00489_human
## MYH1 1
## MYH2 2
## ATP2A1 0
## MYH7B 0
## MYH7 2
## ATP2A2 1
Now we will perform similar things for the chimp expression data:
expr_qc_chimp_fast_mean<-colMeans(expr_qc_chimp_fast)
expr_qc_chimp_fast_mean_filtered<-expr_qc_chimp_fast_mean[as.numeric(expr_qc_chimp_fast_mean)>=mean(as.numeric(colMeans(expr_qc_chimp_fast_slow)))]
length(expr_qc_chimp_fast_mean_filtered)
## [1] 764
head(expr_qc_chimp_fast_mean_filtered)
## C00007_chimp C00008_chimp C00030_chimp C00034_chimp C00039_chimp C00044_chimp
## 0.6666667 0.6666667 1.0000000 1.0000000 1.0000000 0.6666667
expr_qc_chimp_slow_mean<-colMeans(expr_qc_chimp_slow)
expr_qc_chimp_slow_mean_filtered<-expr_qc_chimp_slow_mean[as.numeric(expr_qc_chimp_slow_mean)>=mean(as.numeric(colMeans(expr_qc_chimp_fast_slow)))]
length(expr_qc_chimp_slow_mean_filtered)
## [1] 911
head(expr_qc_chimp_slow_mean_filtered)
## C00008_chimp C00038_chimp C00047_chimp C00050_chimp C00051_chimp C00066_chimp
## 1.0000000 0.6666667 1.6666667 0.6666667 0.6666667 1.0000000
length(intersect(names(expr_qc_chimp_fast_mean_filtered),names(expr_qc_chimp_slow_mean_filtered)))
## [1] 185
head(intersect(names(expr_qc_chimp_fast_mean_filtered),names(expr_qc_chimp_slow_mean_filtered)))
## [1] "C00008_chimp" "C00142_chimp" "C00175_chimp" "C00208_chimp" "C00218_chimp"
## [6] "C00227_chimp"
expr_qc_chimp_fast_slow[,match(intersect(names(expr_qc_chimp_fast_mean_filtered),names(expr_qc_chimp_slow_mean_filtered)),colnames(expr_qc_chimp_fast_slow))][1:6,1:6]
## C00008_chimp C00142_chimp C00175_chimp C00208_chimp C00218_chimp
## MYH1 0 1 2 0 1
## MYH2 1 1 0 0 0
## ATP2A1 1 0 0 2 2
## MYH7B 1 0 0 0 0
## MYH7 0 1 1 1 1
## ATP2A2 2 1 4 2 2
## C00227_chimp
## MYH1 1
## MYH2 1
## ATP2A1 0
## MYH7B 1
## MYH7 1
## ATP2A2 1
Next we are going to visualize the fractions of the type 1 and type 2 fibers for human. By type 1 nuclei we understand nuclei that express type 1 genes above the average level but type 2 genes below the average level. In contrast, by type 2 nuclei we understand nuclei that express type 2 genes above the average level but type 1 genes below the average level:
dim(subset(expr_qc_human_fast_slow,select=colnames(expr_qc_human_fast_slow)[as.numeric(expr_qc_human_fast_mean)>= mean(as.numeric(colMeans(expr_qc_human_fast_slow)))& as.numeric(expr_qc_human_slow_mean)<mean(as.numeric(colMeans(expr_qc_human_fast_slow)))]))[2]
## [1] 719
dim(subset(expr_qc_human_fast_slow,select=colnames(expr_qc_human_fast_slow)[as.numeric(expr_qc_human_fast_mean)<mean(as.numeric(colMeans(expr_qc_human_fast_slow))) & as.numeric(expr_qc_human_slow_mean)>=mean(as.numeric(colMeans(expr_qc_human_fast_slow)))]))[2]
## [1] 726
dim(subset(expr_qc_human_fast_slow,select=colnames(expr_qc_human_fast_slow)[as.numeric(expr_qc_human_fast_mean)>=mean(as.numeric(colMeans(expr_qc_human_fast_slow))) & as.numeric(expr_qc_human_slow_mean)>=mean(as.numeric(colMeans(expr_qc_human_fast_slow)))]))[2]
## [1] 91
slices_human <- c(dim(subset(expr_qc_human_fast_slow,select=colnames(expr_qc_human_fast_slow)[as.numeric(expr_qc_human_fast_mean)>=mean(as.numeric(colMeans(expr_qc_human_fast_slow))) & as.numeric(expr_qc_human_slow_mean)<mean(as.numeric(colMeans(expr_qc_human_fast_slow)))]))[2], dim(subset(expr_qc_human_fast_slow,select=colnames(expr_qc_human_fast_slow)[as.numeric(expr_qc_human_fast_mean)<mean(as.numeric(colMeans(expr_qc_human_fast_slow))) & as.numeric(expr_qc_human_slow_mean)>=mean(as.numeric(colMeans(expr_qc_human_fast_slow)))]))[2], dim(subset(expr_qc_human_fast_slow,select=colnames(expr_qc_human_fast_slow)[as.numeric(expr_qc_human_fast_mean)>=mean(as.numeric(colMeans(expr_qc_human_fast_slow))) & as.numeric(expr_qc_human_slow_mean)>=mean(as.numeric(colMeans(expr_qc_human_fast_slow)))]))[2])
lbls_human <- c("Fast (47%)", "Slow (47%)", "Both (6%)")
pie(slices_human, labels = lbls_human, main="Human")
dim(subset(expr_qc_chimp_fast_slow,select=colnames(expr_qc_chimp_fast_slow)[as.numeric(expr_qc_chimp_fast_mean)>=mean(as.numeric(colMeans(expr_qc_chimp_fast_slow))) & as.numeric(expr_qc_chimp_slow_mean)<mean(as.numeric(colMeans(expr_qc_chimp_fast_slow)))]))[2]
## [1] 579
dim(subset(expr_qc_chimp_fast_slow,select=colnames(expr_qc_chimp_fast_slow)[as.numeric(expr_qc_chimp_fast_mean)<mean(as.numeric(colMeans(expr_qc_chimp_fast_slow))) & as.numeric(expr_qc_chimp_slow_mean)>=mean(as.numeric(colMeans(expr_qc_chimp_fast_slow)))]))[2]
## [1] 726
dim(subset(expr_qc_chimp_fast_slow,select=colnames(expr_qc_chimp_fast_slow)[as.numeric(expr_qc_chimp_fast_mean)>=mean(as.numeric(colMeans(expr_qc_chimp_fast_slow))) & as.numeric(expr_qc_chimp_slow_mean)>=mean(as.numeric(colMeans(expr_qc_chimp_fast_slow)))]))[2]
## [1] 185
slices_chimp <- c(dim(subset(expr_qc_chimp_fast_slow,select=colnames(expr_qc_chimp_fast_slow)[as.numeric(expr_qc_chimp_fast_mean)>=mean(as.numeric(colMeans(expr_qc_chimp_fast_slow))) & as.numeric(expr_qc_chimp_slow_mean)<mean(as.numeric(colMeans(expr_qc_chimp_fast_slow)))]))[2], dim(subset(expr_qc_chimp_fast_slow,select=colnames(expr_qc_chimp_fast_slow)[as.numeric(expr_qc_chimp_fast_mean)<mean(as.numeric(colMeans(expr_qc_chimp_fast_slow))) & as.numeric(expr_qc_chimp_slow_mean)>=mean(as.numeric(colMeans(expr_qc_chimp_fast_slow)))]))[2], dim(subset(expr_qc_chimp_fast_slow,select=colnames(expr_qc_chimp_fast_slow)[as.numeric(expr_qc_chimp_fast_mean)>=mean(as.numeric(colMeans(expr_qc_chimp_fast_slow))) & as.numeric(expr_qc_chimp_slow_mean)>=mean(as.numeric(colMeans(expr_qc_chimp_fast_slow)))]))[2])
lbls_chimp <- c("Fast (39%)", "Slow (49%)", "Both (12%)")
pie(slices_chimp, labels = lbls_chimp, main="Chimp")
Now we will save the working environment for using it later:
save.image(file = "snRNAseq.RData")
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=sv_SE.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=sv_SE.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=sv_SE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=sv_SE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] KernSmooth_2.23-20 fields_14.1 viridis_0.6.2
## [4] viridisLite_0.4.0 spam_2.9-1 DoubletFinder_2.0.3
## [7] matrixStats_0.62.0 data.table_1.14.2 dplyr_1.0.9
## [10] sp_1.4-7 SeuratObject_4.1.0 Seurat_4.1.1
##
## loaded via a namespace (and not attached):
## [1] Rtsne_0.16 colorspace_2.0-3 deldir_1.0-6
## [4] ellipsis_0.3.2 ggridges_0.5.3 rstudioapi_0.13
## [7] spatstat.data_2.2-0 farver_2.1.0 leiden_0.4.2
## [10] listenv_0.8.0 ggrepel_0.9.1 fansi_1.0.3
## [13] codetools_0.2-18 splines_4.1.0 knitr_1.39
## [16] polyclip_1.10-0 jsonlite_1.8.0 ica_1.0-2
## [19] cluster_2.1.2 png_0.1-7 rgeos_0.5-9
## [22] uwot_0.1.11 shiny_1.7.1 sctransform_0.3.3
## [25] spatstat.sparse_2.1-1 compiler_4.1.0 httr_1.4.3
## [28] assertthat_0.2.1 Matrix_1.3-4 fastmap_1.1.0
## [31] lazyeval_0.2.2 limma_3.48.3 cli_3.3.0
## [34] later_1.3.0 htmltools_0.5.2 tools_4.1.0
## [37] dotCall64_1.0-2 igraph_1.3.1 gtable_0.3.0
## [40] glue_1.6.2 RANN_2.6.1 reshape2_1.4.4
## [43] maps_3.4.0 Rcpp_1.0.8.3 scattermore_0.8
## [46] jquerylib_0.1.4 vctrs_0.4.1 nlme_3.1-152
## [49] progressr_0.10.0 lmtest_0.9-40 spatstat.random_2.2-0
## [52] xfun_0.31 stringr_1.4.0 globals_0.15.0
## [55] mime_0.12 miniUI_0.1.1.1 lifecycle_1.0.1
## [58] irlba_2.3.5 goftest_1.2-3 future_1.25.0
## [61] MASS_7.3-54 zoo_1.8-10 scales_1.2.0
## [64] spatstat.core_2.4-4 promises_1.2.0.1 spatstat.utils_2.3-1
## [67] parallel_4.1.0 RColorBrewer_1.1-3 yaml_2.3.5
## [70] reticulate_1.25 pbapply_1.5-0 gridExtra_2.3
## [73] ggplot2_3.3.6 sass_0.4.1 rpart_4.1-15
## [76] stringi_1.7.6 highr_0.9 rlang_1.0.4
## [79] pkgconfig_2.0.3 evaluate_0.15 lattice_0.20-44
## [82] tensor_1.5 ROCR_1.0-11 purrr_0.3.4
## [85] labeling_0.4.2 patchwork_1.1.1 htmlwidgets_1.5.4
## [88] cowplot_1.1.1 tidyselect_1.1.2 parallelly_1.31.1
## [91] RcppAnnoy_0.0.19 plyr_1.8.7 magrittr_2.0.3
## [94] R6_2.5.1 generics_0.1.2 DBI_1.1.2
## [97] withr_2.5.0 mgcv_1.8-36 pillar_1.7.0
## [100] fitdistrplus_1.1-8 survival_3.2-11 abind_1.4-5
## [103] tibble_3.1.7 future.apply_1.9.0 crayon_1.5.1
## [106] utf8_1.2.2 spatstat.geom_2.4-0 plotly_4.10.0
## [109] rmarkdown_2.14 grid_4.1.0 digest_0.6.29
## [112] xtable_1.8-4 tidyr_1.2.0 httpuv_1.6.5
## [115] munsell_0.5.0 bslib_0.3.1